Introduction

This comprehensive guide demonstrates two-dimensional functional clustering analysis using the MultiConnector package. Unlike one-dimensional analysis that focuses on a single measurement type, two-dimensional analysis simultaneously models multiple related measurements, providing richer insights into complex biological or clinical patterns.

We use the MCL (Multiple Correspondence Longitudinal) dataset as our case study, which contains two simultaneous measurements (PB and BM) tracked over time. This approach is particularly valuable for:

  • Multi-biomarker studies: Analyzing multiple biomarkers simultaneously
  • Clinical monitoring: Tracking multiple clinical parameters
  • Biological systems: Understanding coordinated responses across measurements
  • Complex phenotyping: Identifying patterns across multiple phenotypic measures

Background

Two-Dimensional Functional Clustering

Traditional clustering often treats each measurement independently, potentially missing important relationships between variables. Two-dimensional functional clustering:

  1. Preserves relationships: Maintains correlations between measurements
  2. Reduces dimensionality: Clusters based on joint patterns rather than individual variables
  3. Improves interpretation: Identifies coordinated patterns across measurements
  4. Enhances stability: Uses more information for robust clustering

The MCL Dataset

The MCL dataset contains:

  • Subjects: Multiple individuals tracked over time
  • Measurements: Two coordinated measures (PB and BM)
  • Annotations: Subject-level features (TTP, Arm, etc.)
  • Time structure: Longitudinal observations across time points

Package Setup and Data Loading

# Load required packages
library(MultiConnector)
library(tibble)
library(dplyr)
library(ggplot2)

# Load the MCL dataset
# This dataset contains two-dimensional time series data
TimeSeries <- readRDS(system.file("Data/MCL/TimeSeries.rds", package="MultiConnector"))
Annotations <- readRDS(system.file("Data/MCL/Annotations.rds", package="MultiConnector"))

# Display basic dataset information
cat("Dataset Overview:")
#> Dataset Overview:
cat("\n- Time series data dimensions:", dim(TimeSeries))
#> 
#> - Time series data dimensions: 1341 5
cat("\n- Annotations dimensions:", dim(Annotations))
#> 
#> - Annotations dimensions: 117 6
cat("\n- Unique subjects:", length(unique(TimeSeries$subjID)))
#> 
#> - Unique subjects: 117
cat("\n- Measurement types:", paste(unique(TimeSeries$measureID), collapse = ", "))
#> 
#> - Measurement types: PB, BM
cat("\n- Time range:", range(TimeSeries$time))
#> 
#> - Time range: 62 1666

Step 1: Data Object Creation

The first step is creating a CONNECTORData object that properly structures our two-dimensional time series data.

# Create the main data object for two-dimensional analysis
Data <- ConnectorData(tibble(TimeSeries), tibble(Annotations))
#> ############################### 
#> Data loaded...
#> Number of Observations:424
#> Number of distinct measures:2
#> Average length:6.325472
# Display comprehensive summary
summary(Data)
#>        Length         Class          Mode 
#>             1 CONNECTORData            S4
show(Data)
#> CONNECTORData object with:
#> - Subjects: 95 
#> - Measures: 2 
#> 
#> Lengths summary:
#>  Min. 1st Qu. Median Mean 3rd Qu. Max. 
#> 3 5 7 6.647368 8 9

Step 2: Initial Data Exploration

Comprehensive exploration of the two-dimensional time series structure and patterns.

Basic Time Series Visualization

# Plot 2.1: Overview of all time series (both measurements)
plot(Data)

This plot shows the raw time series for both PB and BM measurements, providing an initial view of the data structure and variability patterns.

Feature-Based Exploration

# Examine available features
available_features <- getAnnotations(Data)
cat("Available features for analysis:")
#> Available features for analysis:
print(available_features)
#> [1] "subjID"      "Arm"         "TTP"         "TTP_date"    "TTP_dateReg"
#> [6] "Dynamic"
# Plot 2.2: Time series colored by key features
# This helps identify potential relationships between features and temporal patterns

# TTP (Time To Progression) analysis
plot(Data, feature="TTP")

# Treatment arm analysis  
plot(Data, feature="Arm")

These visualizations reveal how different clinical features (TTP, treatment arms) may relate to the temporal patterns in our two measurements.

Temporal Structure Analysis

# Plot 2.3: Detailed time distribution analysis
# This is crucial for two-dimensional data to understand sampling patterns

# Comprehensive time analysis
plotTimes(Data, large=TRUE)

# Summary time analysis
plotTimes(Data, large=FALSE)

The time distribution analysis is particularly important for two-dimensional data as it reveals:

  • Sampling consistency: Whether both measurements are sampled similarly
  • Missing data patterns: Potential gaps in either measurement
  • Temporal coverage: The time span and density of observations

Step 3: Spline Dimension Estimation

For two-dimensional analysis, we need to determine optimal spline dimensions for both measurements simultaneously.

# Cross-validation for optimal spline basis dimensions
# This step is crucial as each measurement type may require different flexibility

cat("Estimating optimal spline dimensions...")
#> Estimating optimal spline dimensions...
cat("\nThis process may take several minutes for two-dimensional data...")
#> 
#> This process may take several minutes for two-dimensional data...
# Estimate dimensions for both measurements
# Testing p values from 2 to 6 for both PB and BM
file_crosslog = system.file("Data/MCL/MCLTwoD_CrossLog.rds", package="MultiConnector")
if(file.exists(file_crosslog)){
  CrossLogLikePlot = readRDS(file_crosslog)
} else {
CrossLogLikePlot <- estimatepDimension(Data, p=2:6, cores=5)
saveRDS(CrossLogLikePlot, paste0(system.file("Data/MCL/", package="MultiConnector"), "MCLTwoD_CrossLog.rds" ) )
}

# Display the results
CrossLogLikePlot
#> $PB

#> 
#> $BM

The dimension estimation plot shows cross-validation results for both measurements:

  • Separate curves: Each measurement (PB, BM) has its own optimal dimension
  • Validation pattern: Look for clear minima or plateaus in the cross-validation curves
  • Computational cost: Two-dimensional analysis requires more computation time
# Based on the cross-validation results, select optimal dimensions
# These values would typically be chosen based on the plotted results
optimal_p <- c("PB" = 4, "BM" = 4)

Selected optimal dimensions: - PB measurement: p = 4 - BM measurement: p = 4

Dimension selection rationale: - Higher p: More flexible curves, captures fine details - Lower p: Smoother curves, reduces overfitting - Optimal p: Balance between flexibility and generalization

Step 4: Two-Dimensional Clustering Analysis

The core clustering step that simultaneously models both measurements to identify coordinated patterns.

Starting two-dimensional clustering analysis. Estimated computation time: ~5-10 minutes with 5 cores Time increases significantly with number of subjects and measurements

file_cluster = system.file("Data/MCL/MCLTwoD_Clustering.rds", package="MultiConnector")
if(file.exists(file_cluster)){
  clusters = readRDS(file_cluster)
} else {
  # Perform clustering with multiple cluster numbers
# This is computationally intensive for two-dimensional data
clusters <- estimateCluster(Data, 
                           G = 2:6,           # Test 2-6 clusters
                           p = optimal_p,     # Use optimal spline dimensions for both measurements
                           runs = 20,         # Multiple runs for stability
                           cores = 5)         # Parallel processing

saveRDS(clusters, paste0(system.file("Data/MCL/", package="MultiConnector"), "MCLTwoD_Clustering.rds" ) )
}
# Plot clustering quality metrics
plot(clusters)

The clustering quality plot shows:

  • fDB (functional Davies-Bouldin): Lower values indicate better cluster separation
  • Computational time: Processing time for each cluster configuration
  • Stability measures: Consistency across multiple runs

Key interpretations: - Optimal G: Look for elbow points or clear minima in fDB - Stability: Consistent results across runs indicate robust solutions - Computational scaling: Time increases with cluster number and data complexity

Step 5: Cluster Selection and Validation

Select the optimal clustering configuration based on quality metrics.

# Select optimal configuration based on quality metrics
# For demonstration, using G=3 clusters with MinfDB criterion

optimal_G <- 3
selection_criterion <- "MinfDB"

cat("Cluster selection:")
#> Cluster selection:
cat("\n- Number of clusters (G):", optimal_G)
#> 
#> - Number of clusters (G): 3
cat("\n- Selection criterion:", selection_criterion)
#> 
#> - Selection criterion: MinfDB
cat("\n- Rationale: Minimum functional Davies-Bouldin index")
#> 
#> - Rationale: Minimum functional Davies-Bouldin index
# Create the final clustered data object
ClusterData <- selectCluster(clusters, G=optimal_G, selection_criterion)

cat("\n\nFinal clustering solution:")
#> 
#> 
#> Final clustering solution:
cat("\n- Clusters identified:", length(unique(ClusterData@cluster.names)))
#> 
#> - Clusters identified: 3
cat("\n- Cluster labels:", paste(ClusterData@cluster.names, collapse = ", "))
#> 
#> - Cluster labels: A, B, C

Step 6: Two-Dimensional Cluster Visualization

Comprehensive visualization of clustering results across both measurements.

Basic Cluster Visualization

# Plot 6.1: Basic cluster visualization across both measurements
plot(ClusterData)

This plot reveals:

  • Coordinated patterns: How PB and BM measurements cluster together
  • Cluster separation: Visual assessment of cluster distinctness
  • Measurement relationships: Correlations between the two measurements within clusters

Feature-Based Cluster Analysis

# Plot 6.2: Clusters colored by clinical features
# This reveals associations between clusters and clinical characteristics

# TTP (Time To Progression) patterns
plot(ClusterData, feature="TTP")

# Treatment arm associations
plot(ClusterData, feature="Arm")

These visualizations help identify:

  • Clinical relevance: Which clusters associate with clinical outcomes
  • Treatment effects: Whether treatment arms segregate into clusters
  • Prognostic value: Cluster relationships with progression times

Cluster Composition Analysis

# Analyze cluster composition across features
available_features <- getAnnotations(ClusterData)
cat("Features available for composition analysis:")
#> Features available for composition analysis:
print(available_features)
#> [1] "Arm"         "TTP"         "TTP_date"    "TTP_dateReg" "Dynamic"
# Detailed composition analysis for key features
cat("\n=== CLUSTER COMPOSITION ANALYSIS ===")
#> 
#> === CLUSTER COMPOSITION ANALYSIS ===
# TTP distribution across clusters
if ("TTP" %in% available_features) {
  cat("\n\n1. TTP Distribution Across Clusters:")
  ttp_dist <- clusterDistribution(ClusterData, feature="TTP")
  print(ttp_dist)
}
#> 
#> 
#> 1. TTP Distribution Across Clusters:# A tibble: 3 × 5
#>   TTP   `Cluster 1` `Cluster 2` `Cluster 3` Total
#>   <chr>       <int>       <int>       <int> <dbl>
#> 1 0               0           2          55    57
#> 2 1               5          13          20    38
#> 3 TOTAL           5          15          75    95
# Treatment arm distribution
if ("Arm" %in% available_features) {
  cat("\n\n2. Treatment Arm Distribution:")
  arm_dist <- clusterDistribution(ClusterData, feature="Arm")
  print(arm_dist)
}
#> 
#> 
#> 2. Treatment Arm Distribution:# A tibble: 4 × 5
#>   Arm   `Cluster 1` `Cluster 2` `Cluster 3` Total
#>   <chr>       <int>       <int>       <int> <dbl>
#> 1 0               1          10          39    50
#> 2 1               3           2          34    39
#> 3 NULL            1           3           2     6
#> 4 TOTAL           5          15          75    95
# Display cluster assignments
cluster_assignments <- getClusters(ClusterData)
cat("\n\n3. Cluster Assignment Summary:")
#> 
#> 
#> 3. Cluster Assignment Summary:
cat("\n- Total subjects clustered:", nrow(cluster_assignments))
#> 
#> - Total subjects clustered: 95
cluster_summary <- table(cluster_assignments$Cluster)
for (i in names(cluster_summary)) {
  cat("\n- Cluster", i, ":", cluster_summary[i], "subjects")
}
#> 
#> - Cluster 1 : 5 subjects
#> - Cluster 2 : 15 subjects
#> - Cluster 3 : 75 subjects

Step 7: Advanced Subject-Level Analysis

Detailed analysis of individual subjects within the two-dimensional clustering context.

Single Subject Analysis

# Select a representative subject for detailed analysis
example_subjects <- unique(cluster_assignments$subjID)[1:3]
selected_subject <- example_subjects[1]

cat("Detailed analysis for subject:", selected_subject)
#> Detailed analysis for subject: Subject 1105
# Comprehensive subject information
subject_info <- SubjectInfo(ClusterData, subjIDs = selected_subject)

cat("\nSubject cluster assignment:")
#> 
#> Subject cluster assignment:
cat("\n", subject_info$cluster_assignments)
#> 
#>  Subject Subject 1105 belongs to Cluster 3
# Display highlighted visualization
subject_info$highlighted_plot

Multi-Subject Comparison

# Compare multiple subjects across clusters
multi_subject_info <- SubjectInfo(ClusterData, subjIDs = example_subjects)

cat("Multi-subject comparison:")
#> Multi-subject comparison:
cat("\nSubjects analyzed:", paste(example_subjects, collapse = ", "))
#> 
#> Subjects analyzed: Subject 1105, Subject 1205, Subject 1206
# Show cluster assignments
for (i in 1:nrow(multi_subject_info$cluster_table)) {
  cat("\n- Subject", multi_subject_info$cluster_table$subjID[i], 
      "-> Cluster", multi_subject_info$cluster_table$cluster[i])
}
#> 
#> - Subject Subject 1105 -> Cluster 3
#> - Subject Subject 1205 -> Cluster 3
#> - Subject Subject 1206 -> Cluster 2
# Display comparative visualization
multi_subject_info$highlighted_plot

Step 8: Cluster Validation and Quality Assessment

Comprehensive validation of the two-dimensional clustering solution.

# Comprehensive cluster validation
cat("Performing cluster validation...")
#> Performing cluster validation...
validation_metrics <- validateCluster(ClusterData)

# Display validation plot
validation_metrics$plot

Validation Metrics Interpretation:

  • Silhouette Analysis: Measures how well subjects fit their assigned clusters
    • Values near +1: Well-clustered subjects
    • Values near 0: Borderline cases
    • Negative values: Potentially misclassified subjects
  • Entropy Analysis: Measures uncertainty in cluster assignments
    • Lower entropy: More confident assignments
    • Higher entropy: More uncertain assignments
# Extract and summarize validation metrics
if (!is.null(validation_metrics$metrics)) {
  cat("\n\nValidation Summary:")
  print(validation_metrics$metrics)
}

# Additional quality assessment
quality_metrics <- ClusterData@TTandfDBandSil
cat("\n\nClustering Quality Metrics:")
#> 
#> 
#> Clustering Quality Metrics:
print(quality_metrics)
#> # A tibble: 1 × 4
#>     fDB    TT     G   Sil
#>   <dbl> <dbl> <int> <dbl>
#> 1 0.788 3815.     3 0.671

Step 9: Advanced Visualizations and Interpretations

Specialized visualizations for understanding two-dimensional clustering patterns.

Discriminant Analysis

# Discriminant analysis visualization
# This projects the high-dimensional functional data into lower-dimensional space

cat("Generating discriminant analysis plots...")
#> Generating discriminant analysis plots...
# Basic discriminant plot
discr_basic <- DiscriminantPlot(ClusterData)
discr_basic$ColCluster

# Feature-enhanced discriminant plot
if ("TTP" %in% available_features) {
  discr_feature <- DiscriminantPlot(ClusterData, feature = "TTP")
  discr_feature$ColFeature
}

The discriminant plots show:

  • Cluster separation: How well clusters separate in reduced space
  • Feature relationships: How clinical features relate to cluster structure
  • Dimensionality: Effective dimensions needed for cluster separation

Maximum Discrimination Analysis

Analyzing maximum discrimination functions.

# Maximum discrimination function analysis
# Identifies the most discriminative aspects of the functional data

max_discr <- MaximumDiscriminationFunction(ClusterData)
max_discr
#> $DiscrFunctionsPlot

#> 
#> $Separated

This analysis reveals:

  • Discriminative time periods: When clusters are most different
  • Measurement importance: Which measurement (PB/BM) drives separation
  • Functional patterns: The shapes that distinguish clusters

Step 10: Subclustering Analysis

Advanced analysis involving subclustering of identified clusters.

Performing subclustering analysis on Cluster 3.

# Get cluster assignments
cluster_assignments <- getClusters(ClusterData)

# Select subjects from cluster 3
cluster3_subjects <- cluster_assignments %>% 
  filter(Cluster == 3) %>% 
  pull(subjID)

cat("\n- Subjects in Cluster 3:", length(cluster3_subjects))
#> 
#> - Subjects in Cluster 3: 75
cat("\n- First few subjects:", paste(head(cluster3_subjects), collapse = ", "))
#> 
#> - First few subjects: Subject 1105, Subject 1205, Subject 1301, Subject 1304, Subject 1308, Subject 1309
# Create subset data for subclustering
subData <- ConnectorData(
  tibble(TimeSeries) %>% filter(subjID %in% cluster3_subjects), 
  tibble(Annotations) %>% filter(subjID %in% cluster3_subjects)
)
#> ############################### 
#> Data loaded...
#> Number of Observations:300
#> Number of distinct measures:2
#> Average length:7.046667
cat("\n\nSubset data for subclustering:")
#> 
#> 
#> Subset data for subclustering:
show(subData)
#> CONNECTORData object with:
#> - Subjects: 75 
#> - Measures: 2 
#> 
#> Lengths summary:
#>  Min. 1st Qu. Median Mean 3rd Qu. Max. 
#> 3 6 7 7.046667 9 9

Performing subclustering analysis. Estimated time: ~5-10 minutes

file_cluster = system.file("Data/MCL/MCLTwoD_SubClustering.rds", package="MultiConnector")
if(file.exists(file_cluster)){
  subClusters = readRDS(file_cluster)
} else {
# Subcluster analysis within Cluster 3
subClusters <- estimateCluster(subData, 
                              G = 2:5,           # Test fewer clusters for subset
                              p = optimal_p,     # Use same spline dimensions
                              runs = 20,         # Multiple runs for stability
                              cores = 5)

saveRDS(clusters, paste0(system.file("Data/MCL/", package="MultiConnector"), "MCLTwoD_SubClustering.rds" ) )
}

# Plot subclustering results
plot(subClusters)

# Select optimal subclustering
subClusterData <- selectCluster(subClusters, G=3, "MinfDB")

Subclustering results:

# Basic subcluster visualization
plot(subClusterData, feature = "TTP")

# Subcluster composition
if ("TTP" %in% getAnnotations(subClusterData)) {
  subcluster_dist <- clusterDistribution(subClusterData, feature="TTP")
  cat("\n\nSubcluster composition (TTP):")
  print(subcluster_dist)
}
#> 
#> 
#> Subcluster composition (TTP):# A tibble: 3 × 5
#>   TTP   `Cluster 1` `Cluster 2` `Cluster 3` Total
#>   <chr>       <int>       <int>       <int> <dbl>
#> 1 0               0           2          55    57
#> 2 1               5          13          20    38
#> 3 TOTAL           5          15          75    95

Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin23.4.0
#> Running under: macOS 15.5
#> 
#> Matrix products: default
#> BLAS:   /usr/local/Cellar/openblas/0.3.28/lib/libopenblasp-r0.3.28.dylib 
#> LAPACK: /usr/local/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Rome
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.2      dplyr_1.1.4        tibble_3.3.0       MultiConnector_1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1    viridisLite_0.4.2   farver_2.1.2       
#>  [4] viridis_0.6.5       fastmap_1.2.0       lazyeval_0.2.2     
#>  [7] promises_1.3.3      digest_0.6.37       mime_0.13          
#> [10] lifecycle_1.0.4     ellipsis_0.3.2      statmod_1.5.0      
#> [13] magrittr_2.0.3      compiler_4.4.1      sass_0.4.10        
#> [16] rlang_1.1.6         tools_4.4.1         utf8_1.2.6         
#> [19] yaml_2.3.10         geometry_0.5.2      data.table_1.16.4  
#> [22] knitr_1.50          labeling_0.4.3      htmlwidgets_1.6.4  
#> [25] pkgbuild_1.4.8      xml2_1.4.0          RColorBrewer_1.1-3 
#> [28] pkgload_1.4.0       abind_1.4-8         miniUI_0.1.2       
#> [31] withr_3.0.2         purrr_1.1.0         desc_1.4.3         
#> [34] grid_4.4.1          roxygen2_7.3.2      urlchecker_1.0.1   
#> [37] profvis_0.4.0       xtable_1.8-4        scales_1.4.0       
#> [40] MASS_7.3-65         isoband_0.2.7       dichromat_2.0-0.1  
#> [43] cli_3.6.5           rmarkdown_2.29      generics_0.1.4     
#> [46] remotes_2.5.0       rlist_0.4.6.2       rstudioapi_0.17.1  
#> [49] httr_1.4.7          tzdb_0.5.0          magic_1.6-1        
#> [52] commonmark_2.0.0    sessioninfo_1.2.3   readxl_1.4.5       
#> [55] cachem_1.1.0        stringr_1.5.1       splines_4.4.1      
#> [58] parallel_4.4.1      cellranger_1.1.0    vctrs_0.6.5        
#> [61] devtools_2.4.5      Matrix_1.7-3        jsonlite_2.0.0     
#> [64] hms_1.1.3           patchwork_1.3.2     plotly_4.11.0      
#> [67] tidyr_1.3.1         jquerylib_0.1.4     glue_1.8.0         
#> [70] stringi_1.8.7       gtable_0.3.6        later_1.4.4        
#> [73] pillar_1.11.0       htmltools_0.5.8.1   R6_2.6.1           
#> [76] gghalves_0.1.4      rprojroot_2.1.1     evaluate_1.0.4     
#> [79] shiny_1.11.1        lattice_0.22-7      MetBrewer_0.2.0    
#> [82] readr_2.1.5         RhpcBLASctl_0.23-42 memoise_2.0.1      
#> [85] httpuv_1.6.16       bslib_0.9.0         Rcpp_1.1.0         
#> [88] gridExtra_2.3       xfun_0.53           fs_1.6.6           
#> [91] usethis_3.1.0       pkgconfig_2.0.3

Appendix: Technical Details

Data Structure Requirements

For two-dimensional functional clustering, the data must contain:

  • subjID: Subject identifiers (consistent across measurements)
  • measureID: Measurement type identifiers (e.g., “PB”, “BM”)
  • time: Time points (can vary across subjects and measurements)
  • value: Measured values at each time point
  • Annotations: Subject-level features for interpretation

Algorithm Details

The two-dimensional clustering approach:

  1. Spline fitting: Fits natural cubic splines to each measurement separately
  2. Joint modeling: Combines spline coefficients across measurements
  3. Clustering: Groups subjects based on joint spline patterns
  4. Validation: Assesses quality using multi-dimensional metrics

Parameter Selection Guidelines

  • Spline dimensions (p): Start with 3-5, adjust based on cross-validation
  • Cluster numbers (G): Test 2-8 clusters, select based on quality metrics
  • Runs: Use 20-50 runs for stability assessment
  • Cores: Use 4-8 cores for optimal performance

This comprehensive guide demonstrates the power of two-dimensional functional clustering for complex longitudinal data analysis. The MultiConnector package provides robust tools for identifying coordinated patterns across multiple measurements, enabling deeper insights into biological and clinical processes.